Link to the journal for A3 :https://github.com/bcb420-2022/Jiyun_Won/wiki/A3:-Data-Set-Pathway-and-Network-Analysis
Expression Data set: GSE84054 link to the GEO page: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE84054
In Assignment 1, the expression data set was read as a dataset in order to filter and normalize. This procedure was completed to have a sorted and simplified dataset for the analysis conducted in assignment 2 and 3.
In Assignment 2, differential gene expression analysis was conducted to identify the mahor up and downregulated pathways. The result demonstrated that both up and down regulated group were mainly associated with metabollic processes. This will be further explored in A3 using non-thresholded analysis and visualization using cytoscape.
if (! requireNamespace("biocManager", quietly = TRUE)){
install.packages("BiocManager", repos = "http://cran.us.r-project.org")
}
##
## The downloaded binary packages are in
## /var/folders/_c/59gypyts1hn8xzqlpwyxvlw80000gn/T//RtmpYUm5gf/downloaded_packages
if (!requireNamespace("knitr", quietly = TRUE))
BiocManager::install("knitr")
if (!requireNamespace("GSA", quietly = TRUE))
BiocManager::install("GSA")
if (!requireNamespace("RCurl", quietly = TRUE))
BiocManager::install("RCurl")
if (! requireNamespace("ComplexHeatmap", quietly = TRUE))
BiocManager::install("ComplexHeatmap")
if (! requireNamespace("gprofiler2", quietly = TRUE))
BiocManager::install("gprofiler2")
if (!requireNamespace("edgeR", quietly = TRUE))
BiocManager::install("edgeR")
if (!requireNamespace("circlize", quietly = TRUE))
BiocManager::install("circlize")
library(BiocManager)
## Bioconductor version '3.12' is out-of-date; the current release version '3.14'
## is available with R version '4.1'; see https://bioconductor.org/install
library(GSA)
## Warning: package 'GSA' was built under R version 4.0.5
library(RCurl)
## Warning: package 'RCurl' was built under R version 4.0.5
library(knitr)
## Warning: package 'knitr' was built under R version 4.0.5
library(data.table)
library(ComplexHeatmap)
## Loading required package: grid
## ========================================
## ComplexHeatmap version 2.6.2
## Bioconductor page: http://bioconductor.org/packages/ComplexHeatmap/
## Github page: https://github.com/jokergoo/ComplexHeatmap
## Documentation: http://jokergoo.github.io/ComplexHeatmap-reference
##
## If you use it in published research, please cite:
## Gu, Z. Complex heatmaps reveal patterns and correlations in multidimensional
## genomic data. Bioinformatics 2016.
##
## This message can be suppressed by:
## suppressPackageStartupMessages(library(ComplexHeatmap))
## ========================================
library(gprofiler2)
library(edgeR)
## Loading required package: limma
library(circlize)
## Warning: package 'circlize' was built under R version 4.0.5
## ========================================
## circlize version 0.4.14
## CRAN page: https://cran.r-project.org/package=circlize
## Github page: https://github.com/jokergoo/circlize
## Documentation: https://jokergoo.github.io/circlize_book/book/
##
## If you use it in published research, please cite:
## Gu, Z. circlize implements and enhances circular visualization
## in R. Bioinformatics 2014.
##
## This message can be suppressed by:
## suppressPackageStartupMessages(library(circlize))
## ========================================
library(Biobase)
## Loading required package: BiocGenerics
## Warning: package 'BiocGenerics' was built under R version 4.0.5
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## The following object is masked from 'package:limma':
##
## plotMA
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, basename, cbind, colnames,
## dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
## grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
## rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
## union, unique, unsplit, which.max, which.min
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
# code retrieved from lecture slide
options(timeout = 1000)
gmt_url = "http://download.baderlab.org/EM_Genesets/current_release/Human/symbol/"
# list all the files on the server
filenames = RCurl:: getURL(gmt_url)
tc = textConnection(filenames)
contents = readLines(tc)
close(tc)
# get the gmt that has all the pathways and does not include terms inferred
# from electronic annotations(IEA) start with gmt file that has pathways only
rx = gregexpr("(?<=<a href=\")(.*.GOBP_AllPathways_no_GO_iea.*.)(.gmt)(?=\">)", contents,
perl = TRUE)
gmt_file = unlist(regmatches(contents, rx))
dest_gmt_file <- file.path("~/Desktop/bcb420/projects", gmt_file)
download.file(paste(gmt_url, gmt_file, sep = ""), destfile = dest_gmt_file)
### From a1
nor_count <- read.table(file ="GSE84054_RLDNormalizedCount_12patientER.txt.gz", header = TRUE)
dt_type <- data.frame("Type"=1:24)
for (type1 in 1:12){
dt_type$Type[type1] = "Primary Tumour"
}
for (type2 in 13:24){
dt_type$Type[type2] = "Sphere"
}
rownames(dt_type) <- colnames(exp)[2:25]
###
type_map <- nor_count[,3:ncol(nor_count)]
rownames(type_map) <- nor_count$EnsemblGeneID
colnames(type_map) <- rownames(dt_type)
modelDesign <- model.matrix(~ dt_type$Type)
head(modelDesign)
## (Intercept) dt_type$TypeSphere
## 1 1 0
## 2 1 0
## 3 1 0
## 4 1 0
## 5 1 0
## 6 1 0
expressionMatrix <- as.matrix(nor_count[, 3:ncol(nor_count)])
rownames(expressionMatrix) <- nor_count$EnsemblGeneID
colnames(expressionMatrix) <- colnames(nor_count)[3:ncol(nor_count)]
minimalSet <- ExpressionSet(assayData = expressionMatrix)
fit <- lmFit(minimalSet, modelDesign)
#Apply empirical Bayes to compute differential expression
fit2 <- eBayes(fit, trend = TRUE) # trend=TRUE specific to RNA-seq data
topfit <- topTable(fit2, coef = ncol(modelDesign),
adjust.method = "BH",
number = nrow(expressionMatrix))
# merge hgnc names to topfit table
output_hits <- merge(nor_count[,1:2], topfit, by.y = 0, by.x = 1, all.y = TRUE)
# sort by p-value
output_hits <- output_hits[order(output_hits$P.Value), ]
### From a1
if (!file.exists("GSE84054_Rawcount_12patientER.txt.gz")){
GSE84054 <- GEOquery::getGEOSuppFiles('GSE84054')
}
## Setting options('download.file.method.GEOquery'='auto')
## Setting options('GEOquery.inmemory.gpl'=FALSE)
fnames = rownames(GSE84054)
exp <- read.delim(fnames[2], header=TRUE, check.names = FALSE)
cpms = edgeR::cpm(exp[, 2:25])
keep = rowSums(cpms > 1) >= 3 # Dataset of 3 samples per group
filtered_exp <- exp[keep,]
mart <- biomaRt::useMart("ensembl", dataset = "hsapiens_gene_ensembl")
## Ensembl site unresponsive, trying asia mirror
map_exp <- biomaRt::getBM(filters = "ensembl_gene_id", attributes = c("hgnc_symbol", "ensembl_gene_id"), values = filtered_exp[1], mart = mart)
copy_filter <- filtered_exp
copy_filter <- merge(map_exp, copy_filter, by.x = 2, by.y = 1, all.y = TRUE)
###
mat <- as.matrix(copy_filter[,3:26])
rownames(mat) <- copy_filter$ensembl_gene_id
d = edgeR::DGEList(counts = mat, group = dt_type$Type)
# estimate Dispersion
d <- estimateDisp(d, modelDesign)
fit <- edgeR::glmQLFit(d, modelDesign)
qlf.pos_vs_neg <- edgeR::glmQLFTest(fit, coef = 'dt_type$TypeSphere')
head(qlf.pos_vs_neg)
## An object of class "DGELRT"
## $coefficients
## (Intercept) dt_type$TypeSphere
## ENSG00000000003 -9.644604 0.02426166
## ENSG00000000419 -10.142654 0.50644680
## ENSG00000000457 -11.227809 -0.21846437
## ENSG00000000460 -12.010001 -0.20673500
## ENSG00000000938 -12.264641 -3.06816274
## ENSG00000000971 -9.705494 -3.93967286
##
## $fitted.values
## RHB037 RHB038 RHB041 RHB049 RHB050 RHB052
## ENSG00000000003 1197.69922 1375.9316 1295.7414 394.91545 1102.50688 1195.22286
## ENSG00000000419 727.81894 836.1274 787.3973 239.98257 669.97237 726.31410
## ENSG00000000457 245.82498 282.4068 265.9479 81.05548 226.28698 245.31672
## ENSG00000000460 112.38492 129.1092 121.5846 37.05650 103.45264 112.15255
## ENSG00000000938 87.09667 100.0577 94.2263 28.71824 80.17428 86.91659
## ENSG00000000971 1126.94072 1294.6434 1219.1907 371.58436 1037.37222 1124.61066
## RHB053 RHB350 RHB351 RHB353 RHB354
## ENSG00000000003 1184.23584 1505.0803 1106.28237 1331.31927 1200.10116
## ENSG00000000419 719.63750 914.6086 672.26667 809.01729 729.27855
## ENSG00000000457 243.06166 308.9142 227.06189 273.25019 246.31798
## ENSG00000000460 111.12160 141.2277 103.80691 124.92303 112.61031
## ENSG00000000938 86.11761 109.4494 80.44884 96.81352 87.27134
## ENSG00000000971 1114.27274 1416.1621 1040.92466 1252.66668 1129.20076
## RHB355 RHB031 RHB032 RHB035 RHB043
## ENSG00000000003 1152.09468 1564.26057 1563.284160 1553.259322 1745.820903
## ENSG00000000419 700.10593 1539.63780 1538.676762 1528.809723 1718.340224
## ENSG00000000457 236.46476 251.84159 251.684386 250.070416 281.072293
## ENSG00000000460 108.10567 116.48136 116.408649 115.662158 130.001096
## ENSG00000000938 83.78031 5.03736 5.034216 5.001933 5.622035
## ENSG00000000971 1084.03044 27.81928 27.801919 27.623634 31.048208
## RHB044 RHB046 RHB047 RHB344 RHB345
## ENSG00000000003 1592.230765 1621.769361 1751.834651 1967.842710 2075.471577
## ENSG00000000419 1567.167723 1596.241358 1724.259311 1936.867223 2042.801922
## ENSG00000000457 256.344709 261.100341 282.040489 316.817184 334.145131
## ENSG00000000460 118.564134 120.763701 130.448904 146.533764 154.548258
## ENSG00000000938 5.127432 5.222554 5.641401 6.337008 6.683603
## ENSG00000000971 28.316714 28.842038 31.155158 34.996711 36.910815
## RHB347 RHB348 RHB349
## ENSG00000000003 1882.070418 1963.532738 2060.08066
## ENSG00000000419 1852.445058 1932.625093 2027.65327
## ENSG00000000457 303.008084 316.123290 331.66724
## ENSG00000000460 140.146802 146.212826 153.40219
## ENSG00000000938 6.060797 6.323129 6.63404
## ENSG00000000971 33.471311 34.920061 36.63710
##
## $deviance
## ENSG00000000003 ENSG00000000419 ENSG00000000457 ENSG00000000460 ENSG00000000938
## 15.805069 10.822281 6.789848 5.987365 38.422946
## ENSG00000000971
## 98.649347
##
## $method
## [1] "oneway"
##
## $unshrunk.coefficients
## (Intercept) dt_type$TypeSphere
## ENSG00000000003 -9.644690 0.02426371
## ENSG00000000419 -10.142795 0.50650305
## ENSG00000000457 -11.228228 -0.21856670
## ENSG00000000460 -12.010918 -0.20694567
## ENSG00000000938 -12.265829 -3.09288363
## ENSG00000000971 -9.705586 -3.94427959
##
## $df.residual
## [1] 22 22 22 22 22 22
##
## $design
## (Intercept) dt_type$TypeSphere
## 1 1 0
## 2 1 0
## 3 1 0
## 4 1 0
## 5 1 0
## 19 more rows ...
##
## $offset
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] 16.73285 16.87158 16.81153 15.62336 16.65003 16.73078 16.72154 16.96129
## [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16]
## [1,] 16.65345 16.83862 16.73485 16.69403 16.97559 16.97497 16.96854 17.08541
## [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24]
## [1,] 16.99332 17.0117 17.08885 17.20512 17.25837 17.16055 17.20293 17.25093
## attr(,"class")
## [1] "CompressedMatrix"
## attr(,"Dims")
## [1] 6 24
## attr(,"repeat.row")
## [1] TRUE
## attr(,"repeat.col")
## [1] FALSE
##
## $dispersion
## [1] 0.6553415 0.6506241 0.8504457 1.1639535 1.5643274 0.6586076
##
## $prior.count
## [1] 0.125
##
## $AveLogCPM
## [1] 6.036774 5.712271 3.594160 2.484793 1.337520 4.960519
##
## $df.residual.zeros
## [1] 22 22 22 22 22 22
##
## $df.prior
## [1] 5.294182
##
## $var.post
## ENSG00000000003 ENSG00000000419 ENSG00000000457 ENSG00000000460 ENSG00000000938
## 0.6966812 0.5151580 0.3828555 0.3675714 1.5710027
## ENSG00000000971
## 3.7365814
##
## $var.prior
## ENSG00000000003 ENSG00000000419 ENSG00000000457 ENSG00000000460 ENSG00000000938
## 0.6063778 0.6117160 0.6913020 0.7640833 0.8417328
## ENSG00000000971
## 0.6304249
##
## $samples
## group lib.size norm.factors
## RHB037 Primary Tumour 18491977 1
## RHB038 Primary Tumour 21243811 1
## RHB041 Primary Tumour 20005708 1
## RHB049 Primary Tumour 6097330 1
## RHB050 Primary Tumour 17022247 1
## 19 more rows ...
##
## $table
## logFC logCPM F PValue
## ENSG00000000003 0.03500217 6.036774 0.007727913 9.305912e-01
## ENSG00000000419 0.73064829 5.712271 4.536961562 4.233188e-02
## ENSG00000000457 -0.31517746 3.594160 0.874353891 3.579526e-01
## ENSG00000000460 -0.29825555 2.484793 0.594870312 4.471681e-01
## ENSG00000000938 -4.42642317 1.337520 17.013091531 3.126372e-04
## ENSG00000000971 -5.68374650 4.960519 25.107247559 2.876932e-05
##
## $comparison
## [1] "dt_type$TypeSphere"
##
## $df.test
## [1] 1 1 1 1 1 1
##
## $df.total
## [1] 27.29418 27.29418 27.29418 27.29418 27.29418 27.29418
#get all the results
qlf_output_hits <- topTags(qlf.pos_vs_neg, sort.by = "PValue", n = nrow(nor_count))
# merge gene names with the top hits
qlf_output_hits_withgn <- merge(filtered_exp[,1:2], qlf_output_hits, by.x = 1, by.y = 0)
head(qlf_output_hits_withgn)
## Var.1 RHB037 logFC logCPM F PValue
## 1 ENSG00000000003 2561 0.03500217 6.036774 0.007727913 9.305912e-01
## 2 ENSG00000000419 1170 0.73064829 5.712271 4.536961562 4.233188e-02
## 3 ENSG00000000457 387 -0.31517746 3.594160 0.874353891 3.579526e-01
## 4 ENSG00000000460 217 -0.29825555 2.484793 0.594870312 4.471681e-01
## 5 ENSG00000000938 45 -4.42642317 1.337520 17.013091531 3.126372e-04
## 6 ENSG00000000971 76 -5.68374650 4.960519 25.107247559 2.876932e-05
## FDR
## 1 0.9598094910
## 2 0.1207295397
## 3 0.5235473730
## 4 0.6064154567
## 5 0.0037633875
## 6 0.0006839251
# Lecture 7
df <- output_hits[output_hits$EnsemblGeneID %in% qlf_output_hits_withgn$Var.1,]
pVal <- df$P.Value
lFC <- df$logFC
df[,"rank"] <- log(pVal, base = 10)*sign(lFC)
df <- df[order(df$rank),]
write.table(x=subset(df, TRUE, c(symbol, rank)),
file="rankedRNAseq.rnk", sep = "\t",
row.names = FALSE, col.names = FALSE, quote = FALSE)
From many different gene set enrichment analysis tools, GSEA is used to conduct this analysis. GSEA (version 4.2.3) is downloaded as an application for the most up-to-date access to the tool and friendly GUI. The bader lab gene set and ranked geneset are used to run GSEA. Bader lab dataset is downloaded and ranked gene set are computed as above. These datasets are loaded on to the GSEA application. Then, I ran GSEA Pre-Ranked after adjusting the parameters as: * Number of Permutations: 1000 * No_Collapse * Max Size: 200 * Min Size: 15
2. Summarize your enrichment results
Overall Report: file:///Users/karenwon/gsea_home/output/apr06/my_analysis.GseaPreranked.1649226188805/index.html
After running the analysis with max and min limit of 200 and 15, 13149 gene sets were filtered out with the remaining 5387 used in the analysis.
The report is specified in two features, gene sets upregulated in phenotype na_pos and gene sets upregulated in phenotype na_neg.
na_pos
na_neg
## Compare to A2 In A2 the thresholded analysis was conducted using g:profiler. According to the result, both up regulated and down regulated terms that were shown to be the major role were both relevant to metabollic processes. However, although the doenregulated group seem tobe similar, the non-thresholded analysis conducted using GSEA seem to demonstrate immune response related major terms for upregulated set. The two analysis indicates that the disease consist of interaction of both immune response and metabolic degradation which supports the paper and present its association with breast cancer.
The Enrichment Analysis result is visualized using the tool Cytoscape which presents the connection of genesets in common with edges and nodes. The Enrichment Map[@Enrichment] application is downloaded within the Cytoscape platform to easily use GSEA analysis result files.
The parameter settings to create an enrichment map are set as the image below: * FDR Q-Value: 0.05 * P-value: 1.0 * Edge Cutoff: 0.375
Initial enrichment map creation as an image:
There are 1170 Nodes and 7142 edges presented in this map.
The AutoAnnotate[@Autoannotate] application is used to annotate the network. The settings are adjusted as the image below: The creation of annotation set will then create clusters of nodes indicating the frequency. We could see that there are a few large clusters of connected gene sets at the top of the map and several small clusters in the surroundings of the large sets. There are seperate genesets presented vertically at the bottom of the map.
## Edit the map for visualization To Improve the presentation of the map the FDR cutOff is restricted further to 0.01. This results in 646 Nodes and 33395 Edges. The map after auto annotate result in much simpler diagram.
To further organize and improve the presentation of the map, the “prevent cluster overlap” option is selected when setting up the auto annotate parameters.
Then, the resulting map is the image below:
# Publication ready figure / Collapse network to a theme network The Enrichment Map is simplified with major theme clusters. Some largest themes include Immune Regulation immunity, Immune Response cell, Regulation cell differentiation, and apc g1 degradation. There seem to be no clear novel pathways.
According to the map the up-regulated genes reveal the most interaction with the immune response cell and it is connected to other major themes such as immune regulation, regulation cell differentiation, inferon gamma negative, etc. These main terms support the conclusion of the original paper as the immune system cells induces the regulation of the cancer growth. As high Immunoglobuline (IG), cancer-derived Ig are known to be present in cancer cells including carcinomas of breast, this also supports the reason why the major terms are present in the map and justifies the original paper. The major down-regulated term demonstrated in the map is the APC degradation. The APC plays a role in suppressing a tumor so the degradation suggests that the suppressor is inadequate in preventing the cancer cell growth. These similar terms (such as immune response) were also predicted using g:profiler in A2. Therefore, we can conclude that the major causation of the disease in this case is due to immune response malfunction.
Research paper that support the result concluded in this study could be (Nestler et al., 2022) The paper states that the pathways relevant to immune system processes are highly visible and that most of the differentially expressed genes are also the immune related. Therefore concluded that the role of immune processes are significant in tumours. Another paper (Jin et al., 2021) adds to the evidence that their study revealed the immune cells (specifically TME immune cell in this case) functionally affected the progression of breast cancer. THe paper suggests that the TME of breast cancer is involved in multiple immunosuppression, which interacts with tumor cell metastasis.